Assignment A2 - Visualization¶

Course "Data processing and Visualization", IE500417, NTNU. Fall 2022

https://www.ntnu.edu/studies/courses/IE500417

Note: plagiarism is strictly forbidden! You should never copy any source code from other students. If you use any code written by others (except the standards libraries: NumPy, SciPy, Pandas, etc), provide a reference.

If the teachers see that your work is mostly copy+paste from online code snippets, the grade can be reduced.

If a case of plagiarism is detected, it will be reported to the administration.

Task description¶

The goal of this assignment is to get familiar with visualization options.

You can choose whichever visualization tools work best for you. Suggested toolset: either Plotly or Matplotlib. Check the notebooks on Blackboard Week 02 with examples.

Submission details (Same as A1)¶

The assignment must be handed in on Blackboard. The following must be handed in:

  1. Report in HTML or PDF format describing the results of this assignment. Preferably, it is generated from the Jupyter notebook you used (Hint: In Jupyter: File > Download as > HTML). Alternatively (if you use plain Python or other tools), prepare a readable report that contains figures and source code snippets necessary to understand your work.
  2. Source code that you used to generate the results. This could be the the Jupyter notebook file, python source files, Matlab files, etc.

Deadlines and grading information on Blackboard.

Detailed steps¶

First the imports

In [1]:
import pandas as pd
from pathlib import Path
import math
import warnings

from pandas.core.common import SettingWithCopyWarning

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

Now do the imports necessary for your visualization tools (MatPlotLib, Plotly, or whatever you use)

TASK FOR YOU:

In [2]:
import plotly.express as px
import plotly.graph_objects as go

We will use data set on olympic game results. The data set comes from Kaggle. Hint: you can look at different data processing solutions by other people there. Just remember that this work is your individual contribution!

In [3]:
DATA_DIR = Path("../data")
olympics = pd.read_csv(DATA_DIR / "athlete_events.csv")
olympics.head()
# olympics["Games"].unique()
Out[3]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN

Task 1: Inspect the columns¶

Always when you work with data, it is important to find out first what kind of data are you working with. How many rows are there, how many columns? What are column types, are there any empty values?

One thing that you can always do: get quick statistics using the .describe() method. That gives the first impression of the values.

In [4]:
olympics.describe()
Out[4]:
ID Age Height Weight Year
count 271116.000000 261642.000000 210945.000000 208241.000000 271116.000000
mean 68248.954396 25.556898 175.338970 70.702393 1978.378480
std 39022.286345 6.393561 10.518462 14.348020 29.877632
min 1.000000 10.000000 127.000000 25.000000 1896.000000
25% 34643.000000 21.000000 168.000000 60.000000 1960.000000
50% 68205.000000 24.000000 175.000000 70.000000 1988.000000
75% 102097.250000 28.000000 183.000000 79.000000 2002.000000
max 135571.000000 97.000000 226.000000 214.000000 2016.000000

Another typical issue: missing values. In the snapshot above we see that some cells have value NaN. It means "not a number", i.e, the value is missing.

Pandas DataFrame has method .isnull(). Use it to find out which columns have empty values.

TASK FOR YOU:

In [5]:
olympics.isnull().sum()
Out[5]:
ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64

Comment: This tells us that Age, Height, Weight and Medal has null values.

The above table shows the number of null values per column. isNull returns boolean values for each cell and sum casts the boolean to 0 and 1, adding them up to give the number of null values in each column.

Task 2: Clarify attribute types¶

Before you visualize data, you should check what types do the attributes have. Your task: inspect all the columns and for each of them clarify: is this attibute quantitative, ordinal or categorical?

Store the names of all categorical attributes in a list and call it categories. In a similar fashion, store ordinal attribute names in a variable named ordinals and quantitative attribute names in quants.

For example, if you think that attributes age and height are categorical while year and team are quantitative, your variables should be:

categories = ["Age", "Height"]
quants = ["Year", "Team"]
ordinals = []

TASK FOR YOU:

In [6]:
categories = ["ID", "Name", "Sex", "Team", "NOC", "Games", "Season", "City", "Sport", "Event"]
quants = ["Age", "Height", "Weight"]
ordinals = ["Year", "Medal"]

Task 3: Display Nordic gold¶

In this step you will visualize gold medals for Nordic countries. First, we select data for each country (this is just one option, there are many ways to do it).

In [7]:
# Comment: This returns a filter: a list with boolean (true/false) values for each row: true if the country code is "NOR":
# olympics["NOC"] == "NOR"
# This selects only those rows from the dataframe, where the filter says "True" in the corresponding row:
# olympics[olympics["NOC"] == "NOR"]
# etc...

nor = olympics[olympics["NOC"] == "NOR"]
swe = olympics[olympics["NOC"] == "SWE"]
den = olympics[olympics["NOC"] == "DEN"]
fin = olympics[olympics["NOC"] == "FIN"]
isl = olympics[olympics["NOC"] == "ISL"]

Task 3.1: Visualize total number of gold medals per Nordic country, in a single plot.

Use a single figure that shows these two attributes: a country and number of medals (for a particular country). When you choose the type of chart, keep in mind the attribute types that you are visualizing!

TASK FOR YOU:

In [8]:
dic_country_vs_medals = {
    "Norway" : nor,
    "Sweden" : swe,
    "Denmark" : den,
    "Finland" : fin,
    "Iceland" : isl
}
keys = dic_country_vs_medals.keys()

def find_total_medals(key):
    return dic_country_vs_medals[key]["Medal"].shape[0] - dic_country_vs_medals[key]["Medal"].isnull().sum()


fig = px.bar(
    x=keys, 
    y=[find_total_medals(key) for key in keys], 
    title="Number of medals won by Nordic Countires", 
    labels={"y": "Total number of medals", "x": "Nordic Countries"},
    height=600
)
fig.update_traces(width=0.3)
fig.layout.template = "plotly_dark"
fig.show()

The next task is to show one more attribute: Season. You should show the total number of medals per Nordic country, but separated by season. I.e, how many medals has Norway got in Summer Olympics, how many in Winter olympics; how many gold medals has Sweden got in Winter Olympics, how many in Summer, etc. Think about the "Season" attribute - what type does it have and what channel would be appropriate to visualize it?

Task 3.2: Visualize total number of gold medals per Nordic country, per season.

TASK FOR YOU:

In [9]:
# Filtering out nordic countries that also won gold medals
gold_medals_per_season = olympics[olympics["NOC"].isin(["NOR", "SWE", "DEN", "FIN", "ISL"])]
gold_medals_per_season["Gold Medal"] = gold_medals_per_season["Medal"] == "Gold"
gold_medals_per_season = gold_medals_per_season.groupby(["NOC", "Season"], as_index=False).sum()
gold_medals_per_season["NOC"] = gold_medals_per_season["NOC"].replace(["NOR", "SWE", "DEN", "FIN", "ISL"], ["Norway", "Sweden", "Denmark", "Finland", "Iceland"])

fig = px.bar(
    gold_medals_per_season, 
    x="NOC", 
    y="Gold Medal",
    labels={
     "Gold Medal" : "Number of Gold Medals",
     "NOC": "Nordic Countries"
    },
    color='Season', barmode='stack',
    height=600,
    title="Number of gold medals won by Nordic Countries in Summer and Winter Olympics", 
)
fig.update_traces(width=0.3)
fig.layout.template = "plotly_dark"
fig.show()

Task 3.3: Reflect/explain: why did you choose these chart types? What marks and channels are they using to convey the information of each attribute?

TASK FOR YOU:

since we are required to show 2 quantative attributes, ie: Gold Medals won in Winter and Summer Olympics, and a categorical attribute, ie: Countries - With the requirements in mind, the stacked bar chart clearly makes more sense since we wish to:

  • have a comparision between the number of medals won by the different Nordic countries.
  • also be able to compare the total number of medals won by each countires using the same plot

Note: Sorting the bars are not needed since we are dealing with categorical data whose order has no relavance.

Task 4: Visualize trend¶

Visualize number of women participating in the Olympics (with or without medals) over the years in a single chart.

TASK FOR YOU:

In [10]:
women_participation = olympics[olympics["Sex"] == "F"].groupby(by=["Year"], as_index=False).count()
fig = px.line(
    women_participation, 
    y="Sex", 
    x="Year", 
    labels={
        "Sex": "Number of Women"
    },
    height=600,
    markers=True,
    title="Number of women participating in the olympics over the years", 
)
fig.layout.template = "plotly_dark"
fig.update_layout(
    xaxis = dict(
        dtick = 4
    )
)
fig.update_xaxes(
    tickangle = -45,
)
fig.show()

TASK FOR YOU: The chart probably will have som weird spikes towards the end. Do some analysis to understand what it means and why does it appear.

The above line plot when plotted with dot marks shows a change in the frequency in which olympic has been conducted over the years. Notice that before 1992, the event was conducted roughly every 4 years, but after 1992, it seems like it has been conducted every 2 years. This is strange because olympics is actually conducted every 4 years. If we go to this wiki link, it says that 1992 was the last year in which Winter and Summer olympics were conducted in the same year. This explains why these spikes occur after 1992 and also why there are marks every 2 years instead of 4. To confirm this, we can plot the same graph but split the data for winter and summer season. If our assumption hold, the graph should be smooth.

In [11]:
women_participation_by_season = olympics[olympics["Sex"] == "F"].groupby(by=["Year", "Season"], as_index=False).count()
fig = px.line(
    women_participation_by_season, 
    y="Sex", 
    x="Year", 
    labels={
        "Sex": "Number of Women"
    },
    height=600,
    color="Season",
    markers=True,
    title="Number of women participating in the olympics over the years split by season", 
)
fig.layout.template = "plotly_dark"
fig.update_layout(
    xaxis = dict(
        dtick = 4
    )
)
fig.update_xaxes(
    tickangle = -45,
)
fig.show()

TASK FOR YOU: Find a reasonable way to aggregate the data to get the overall trend over the years. I.e., get rid of the spikes.

In [12]:
women_participation_before_and_in_1992 = olympics[
    (olympics["Sex"] == "F") & 
    (olympics["Year"] <= 1992)
].groupby(
    by=["Year"], 
    as_index=False
).count()

women_participation_after_1992_winter = olympics[
    (olympics["Sex"] == "F") & 
    (olympics["Year"] > 1992) & 
    (olympics["Season"] == "Winter")
].groupby(
    by=["Year"], 
    as_index=False
).count()

women_participation_after_1992_summer = olympics[
    (olympics["Sex"] == "F") & 
    (olympics["Year"] > 1992) & 
    (olympics["Season"] == "Summer")
].groupby(
    by=["Year"], 
    as_index=False
).count()

# Adding 2 to year column (to align winter and summer together)
women_participation_after_1992_winter["Year"] = women_participation_after_1992_winter["Year"] + 2

# Adding the counts of summer and winter together
adjusted_post_1992_women_participation = pd.concat(
    [
        women_participation_after_1992_summer, 
        women_participation_after_1992_winter
    ]
).groupby(
    by = ["Year"], 
    as_index=False
).sum()

# Lastly we join it back to pre 1992 data
adjusted_women_participation = pd.concat(
    [
        women_participation_before_and_in_1992, 
        adjusted_post_1992_women_participation
    ]
).groupby(
    by = ["Year"], 
    as_index=False
).sum()

fig = px.line(
    adjusted_women_participation, 
    y="Sex", 
    x="Year", 
    labels={
        "Sex": "Number of Women"
    },
    height=600,
    markers=True,
    title="Number of women participating in the olympics over the years (Adjusted after 1992)", 
)
fig.layout.template = "plotly_dark"
fig.update_layout(
    xaxis = dict(
        dtick = 4
    )
)
fig.update_xaxes(
    tickangle = -45,
)
fig.show()

Your reflection here - Explain why your approach makes sense.
As seen from the previous graph, winter and summer olympics were staggered after 1192 and were held in different years since then (2 years after each other). Thus to get the actual correct trend of women participation overall, one aggrigation that makes sense in to align winter olympics with summer olymipics held after 1992. And by alignment, I mean we can add the number of women participating in the winter (W + 2 = S) and summer (S) olympics. This gives us the commonly known 4 year gap between each data point after 1992 as well. Plotting the graph (as shown above) shows the correct participation trend without the spikes.

Task 5: Visualize statistics¶

Sometimes you want to look at distribution of values for an attribute. There are several choices. Histogram is one of them. Idea of histogram: we split the value range into intervals of specific size and count how many attributes have value that fits within each interval.

Task 5.1: Show histogram of weight for all gold medal winners in the WINTER season.

TASK FOR YOU:

In [13]:
bin_width = 1 # To ensure that the two histograms are comparable

winter_weight_data = olympics[(olympics["Season"] == "Winter") & (not olympics["Medal"].empty) & (olympics["Medal"] == "Gold")]
summer_weight_data = olympics[(olympics["Season"] == "Summer") & (not olympics["Medal"].empty) & (olympics["Medal"] == "Gold")]

max_weight = max(winter_weight_data["Weight"].max(), summer_weight_data["Weight"].max())
min_weight = 0

nbins = math.ceil((max_weight - min_weight) / bin_width)

Reasoning for bins and scale¶

In order to keep the histograms comparable, we need to make sure that the scales match, and the bin widths are same, This can be done by getting the maximum of the label on the x axis in both dataframes and using [0, max_weight] as a common range for both histograms. Now using this range, and our predecided bin_width = 1, we can compute the number of bins (nbins). This ensures that the bin width are same. Using the same range for the histograms also ensures the bins start and end at same windows (can be verified by zooming in). The reason for have same range for y axis is since the scale of the graph could be misleading and we are actually interested in comparision here, hence we need to ensure that in terms of scale, the two histograms are same on the y-axis. There are two ways of doing it, one is to actually define a scale individally for each histogram (cognitive overload for the audience), another is to simply match the y-axis range and thus the graphs automatically match in scale (provided they have same height)

In [14]:
fig = px.histogram(
    winter_weight_data, 
    x="Weight", 
    height=600, 
    nbins=nbins,
    title="Weight distribution of gold medalists in winter olympics",
    range_x=[min_weight, max_weight],
    range_y=[0, 400]
)
fig.update_layout(
    bargap= 0.1
)
fig.layout.template = "plotly_dark"
fig.show()

Task 5.2: Show histogram of weight for all gold medal winners in the SUMMER season. Note: the histogram should be comparable with the WINTER histogram.

TASK FOR YOU:

In [15]:
fig = px.histogram(
    summer_weight_data, 
    x="Weight", 
    height=600, 
    nbins=nbins,
    title="Weight distribution of gold medalists in summer olympics",
    range_x=[min_weight, max_weight],
    range_y=[0, 400]
)
fig.update_layout(
    bargap= 0.1
)

fig.layout.template = "plotly_dark"
fig.show()

Task 6: Visualize trend of statistics (a bit more challenging)¶

In this task you will visualize how one attribute varies over time. But instead of visualizing a single value, we will look at the statistics.

First, we calculate Body Mass Index for all participants. Formula: BMI = Weight / (Height * Height)

In [16]:
olympics["BMI"] = olympics["Weight"] / (olympics["Height"] ** 2)

Your task is to group the entries by year, for each year show the statistics of BMI: the mean BMI value and the range of values. There are different options available, including:

  • Show a box plot for each year
  • Show Bollinger Bands (R)" for the BMI variable. Bollinger Bands is typically used in finance. But it uses statistical concepts in the core: a moving average line shown in the middle, upper and lower bands shown as the mean +/- two standard deviations.
  • Be creative! :)

Note: the idea is to show this in a single plot. Don't create a separate box plot for every year.

Some inspirational examples from a Kaggle solution by Marco Giuseppe de Pinto):

TASK FOR YOU:

In [17]:
olympics = olympics.dropna()

fig = go.Figure([
    go.Box(y=olympics["BMI"], x=olympics["Year"], boxmean='sd', boxpoints = False)
])
fig.update_layout(
    xaxis = dict(
        dtick = 4
    ),
    title="BMI statistics of Olympics over the years",
    yaxis_title="BMI",
    xaxis_title="Year",
    height=600
)
fig.update_xaxes(
    tickangle = -45,
)
fig.layout.template = "plotly_dark"
fig.show()

Reflection¶

Please reflect on the following questions:

  1. How did the assignment go? Was it easy or hard?
    A. This was slightly challenging than the last one.

  2. How many hours did you spend on it?
    A. 3 hrs

  3. What was the most time-consuming part?
    A. Finding the right aggregate and presenting the statistics in a creative way.

  4. If you need to do similar things later in your professional life, how can you improve? How can you do it more efficiently?
    A. In trend (line) graphs, the lines could be misleading and hide information so from here on I will make sure to plot the dot markers that generate those lines as well to get a better picture.

  5. Was tthere something you would expect to learn that this exercise did not include?
    A. No, it perfectly covered the week 5 lecture and I have to go back and forth between lecture notes to make sure I am on the right track.

  6. Was there something that does not make sense?
    A. Nothing as such